White Wine Dataset Exploration by Yinglu Zhang

This report explores a dataset of around 5000 white wine samples, which contains the quality rating by wine experts and 11 attributes of physicochemical properties.

Univariate Plots Section

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

The dataset contains 12 variables and almost 5000 observations.

The fixed acidity (measured by tartaric acid concentration) of the white wine samples is normally distributed between 4 and 10 g / dm^3 with a peak at 6.8 g / dm^3.

The volatile acidity (measured by acetic acid concentration) of the white wine samples is normally distributed between 0.1 and 0.5 g / dm^3 with a peak at 0.25 g / dm^3.

The citric acid concentrations of the white wine samples are normally distributed between 0 and 0.75 g / dm^3 with a peak at 0.3 g / dm^3.

The residual sugar concentrations of the white wine samples are aggregated around 1-1.5 g / dm^3 and have a long tail between 1.5 and 20 g / dm^3. So I transformed the long tail data by plotting on a log2 scale, and the transformed data of residual sugar is bimodal with a first peak around 1.5g / dm^3 and a second peak at 8 g / dm^3.

The chlorides (measured by sodium chloride concentration) of the white wine samples is normally distributed between 0.02 and 0.08 g / dm^3 with a peak at around 0.04 g / dm^3.

The free sulfur dioxide concentrations of the white wine samples are normally distributed between 0 and 75 mg / dm^3 with a peak at 30 mg / dm^3.

The total sulfur dioxide concentrations of the white wine samples are nearly normally distributed (a little skewed to the right) between 50 and 250 mg / dm^3 with a peak at 120 mg / dm^3.

The density of the white wine samples is distributed with a little skewed to the left between 0.985 and 1 g / cm^3 with a peak at 0.9925 g / cm^3.

The pH of the white wine samples is normally distributed between 2.7 and 3.7 with a peak at 3.2.

The sulphates (measured by potassium sulphate concentrations) of the white wine samples is nearly normally distributed (a little skewed to the right) between 0.2 and 1 g / dm^3 with a peak at 0.45 g / dm^3.

The alcohol concentration of the white wine samples is distributed skewed to the right between 8% and 14% with a peak at around 9.5%.

The quality ratings of the white wine samples are distributed skewed to the right between 3 and 9 with a peak at 6. Most samples are between rating 5 and 7.

Then, I subset the white wine samples with high quality (top 25% in quality ratings) and low quality (bottom 25% in quality ratings). I will compare the alcohol, residual sugar and chlorides in these two subsets.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

From the comparisons of alcohol concentrations in high- and low-quality white wine samples, I found that the high-quality white wines generally have higher alcohol concentrations (around 12%), while the low-quality white wines generally have lower alcohol concentrations (around 9%).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

From the comparisons of chlorides concentrations in high- and low-quality white wine samples, I found that the high-quality white wines mostly have lower chlorides concentrations (around 0.035g / dm^3), while the low-quality white wines mostly have higher chlorides concentrations (around 0.045g / dm^3). However, the difference is not very large.

Univariate Analysis

What is the structure of your dataset?

There are 4898 white wine samples in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality). Quality is an ordered factor with the following level:

(worst) – (best)

3,4,5,6,7,8,9

Other observations: - Most white wine samples have a density around 0.995 - 75% of the white wine samples have under 10 g / dm^3 of residual sugar - The median alcohol concentration of the white wine samples is 10.4% - Most of the white wine samples have chlorides level of 0.05 g / dm^3 or less - The median quality rating of the white wine samples is 6

What is/are the main feature(s) of interest in your dataset?

The quality rating is the main feature of interest in my dataset because it’s the output variable based on the other variables as input. I’d like to determine which features are best for predicting the quality of a white wine sample based on its other features.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think fixed acidity, volatile acidity, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, sulphates and alcohol probably contribute to the quality rating of a white wine sample. I think alcohol probably contributes the most to the quality rating after some investigations and comparisons of the high- and low-quality wine samples.

Did you create any new variables from existing variables in the dataset?

I created one new variable ‘quality_level’ to turn’ the quality ratings’quality’ into a factor variable with 7 levels.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I log-transformed the right-skewed residual sugar distributions using log2 of the x-axis. The transformed distribution of residual sugar appears bimodal with the concentration peaking around 1.5g / dm^3 and then again at around 8 g / dm^3.

Bivariate Plots Section

Quality correlates strongly with alcohol. It also correlates moderately with fixed acidity, volatile acidity, residual sugar, chlorides, free sulfur dioxide and total sulfur dioxide.

Density correlates strongly with alcohol and residual sugar. It also correlates moderately with chlorides, free sulfur dioxide, and total sulfur dioxide.

These mentioned variables also correlate moderately with one another. Therefore, I want to look closer at box plots and scatter plots involving quality and some other variables like density, fixed acidity, volatile acidity, residual sugar, chlorides, free sulfur dioxide and total sulfur dioxide.

Among low-quality white wine samples (quality between 3 and 5), the alcohol concentration is negatively correlated with the quality rating. However, among higher quality white wine samples (quality between 5 and 9), the alcohol concentration is positively correlated with the quality rating. The higher the alcohol concentration, the higher the quality.

The volatile acidity level is constant among higher quality white wine samples (quality >=6). But the level is varied among lower quality white wine samples (quality <6) and is generally higher than that of higher quality white wine samples. But no strong correlation is observed.

The fixed acidity level is constant among the white wine samples. No strong correlation is observed.

The residual sugar concentration is low among very low (quality<5) and very high (quality>6) white wine samples. But the concentration is high among the white wine samples around median quality rating (quality of 5-6).

Among low-quality white wine samples (quality between 3 and 5), the chlorides concentration is positively correlated with quality rating. However, among higher-quality white wine samples (quality between 5 and 9), the chlorides concentration is negatively correlated with the quality rating. The lower the chlorides concentration, the higher the quality.

The total sulfur dioxide level is constant among higher quality white wine samples (quality >=6). But the level is varied among lower quality white wine samples (quality <6). But no strong correlation is observed.

The free sulfur dioxide level is mostly constant among white wine samples (quality >=5). But the level is lower among low-quality white wine samples (quality <5). But no strong correlation is observed.

The density of white wine samples negatively correlates strongly with the alcohol concentration strongly. This could be explained by the density difference between alcohol (0.789 g / cm^3) and water (1 g / cm^3).

There is a moderately positive correlation between the density and residual sugar concentration of white wine samples.

No strong correlation is observed between density and chlorides, as the scatter points are distributed evenly at different chlorides concentrations.

No strong correlation is observed between density and free sulfur dioxide, as the scatter points are distributed evenly at different free sulfur dioxide concentrations.

There is a positive correlation between the density and total sulfur dioxide concentration of white wine samples.

No strong correlation is observed between pH and citric acid, as the scatter points are distributed evenly at different citric acid concentrations. This is quite surprising because I expect the citric acid will also contirbute to the acidity of the white wine samples, which makes the pH values lower.

There is a negative correlation between the alcohol concentration and the chlorides concentration of white wine samples.

No strong correlation is observed between alcohol and residual sugar concentrations, as the scatter points are distributed evenly at different residual sugar concentrations.

No strong correlation is observed between alcohol and free sulfur dioxide concentrations, as the scatter points are distributed evenly at different free sulfur dioxide concentrations.

No strong correlation is observed between alcohol and total sulfur dioxide concentrations, as the scatter points are distributed evenly at different total sulfur dioxide concentrations.

No strong correlation is observed between residual sugar and free sulfur dioxide concentrations, although the residual sugar level is slightly higher at higher free sulfur dioxide concentrations.

No strong correlation is observed between residual sugar and total sulfur dioxide concentrations, although the residual sugar level is slightly higher at higher total sulfur dioxide concentrations.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

These density plots elaborate on the trends that were seen in the box plots earlier. White wine samples with higher levels of chlorides and lower levels of alcohol receive lower quality ratings, while white wine samples with lower levels of chlorides and higher levels of alcohol receive higher quality ratings.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The quality of white wine samples correlates strongly with alcohol and chlorides concentrations. Among low-quality white wine samples (quality between 3 and 5), the alcohol concentration is negatively correlated with the quality rating, while the chlorides concentration is positively correlated with the quality rating. Among higher quality white wine samples (quality between 5 and 9), the alcohol conentration is positively correlated with quality rating, while the chlorides conentration is negatively correlated with quality rating.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes. I observed that the density of the white wine samples is negatively correlated with the alcohol concentrations and positively correlated with residual sugar concentrations and total sulfur dioxide concentrations. Also, the alcohol concentration is negatively correlated with the chlorides concentrations.

What was the strongest relationship you found?

The strongest relationship I found was the negative correlation between the alcohol concentration and the density of the white wine samples. Also, the correlation between the quality and the alcohol and chlorides concentrations are relatively strong.

Multivariate Plots Section

This scatter plot confirms that the alcohol is negatively correlated with chlorides concentration. Lower chlorides concentration is observed at higher alcohol concentrations, and higher chlorides concentration is observed at lower alcohol concentrations. However, the chlorides concentration does not correlate with the quality rating of white wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

This scatter plot confirms that the alcohol is negatively correlated with residual sugar concentration. Lower residual sugar concentration is observed at higher alcohol concentrations, and higher residual sugar concentration is observed at lower alcohol concentrations. However, the residual sugar concentration does not correlate with the quality rating of white wine.

Now, I’d like to see if the proportion of free sulfur dioxide in total sulfur dioxide is correlated with quality of white wine. So I created a new variable ‘free.sulfur.dioxide.prop’, which is free sulfur dioxide concentration divided by total sulfur dioxide concentration. Then I used boxplot to explore its relationship with quality ratings.

The proportion of free sulfur dioxide in total sulfur dioxide is positively correlated with the quality ratings of white wine samples. The higher the proportion, the higher the quality ratings.

This scatter plot confirms that the alcohol is negatively correlated with chlorides concentration. Lower chlorides concentration is observed at higher alcohol concentrations, and higher chlorides concentration is observed at lower alcohol concentrations. However, the chlorides concentration does not correlate with the proportion of free sulfur dioxide in total sulfur dioxide of white wine.

This scatter plots confirms that the alcohol is negatively correlated with residual sugar concentration. Lower residual sugar concentration is observed at higher alcohol concentrations, and higher residual sugar concentration is observed at lower alcohol concentrations. However, the residual sugar concentration does not correlate with the proportion of free sulfur dioxide in total sulfur dioxide of white wine.

The plots suggest that we can build a linear model and use those variables in the linear model to predict the quality rating of a white wine.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = df)
## m2: lm(formula = I(quality) ~ I(alcohol) + I(free.sulfur.dioxide/total.sulfur.dioxide), 
##     data = df)
## m3: lm(formula = I(quality) ~ I(alcohol) + I(free.sulfur.dioxide/total.sulfur.dioxide) + 
##     residual.sugar, data = df)
## m4: lm(formula = I(quality) ~ I(alcohol) + I(free.sulfur.dioxide/total.sulfur.dioxide) + 
##     residual.sugar + chlorides, data = df)
## 
## =======================================================================================================
##                                                     m1            m2            m3            m4       
## -------------------------------------------------------------------------------------------------------
##   (Intercept)                                      2.582***      2.256***      1.786***      2.038***  
##                                                   (0.098)       (0.099)       (0.116)       (0.134)    
##   I(alcohol)                                       0.313***      0.306***      0.341***      0.326***  
##                                                   (0.009)       (0.009)       (0.010)       (0.011)    
##   I(free.sulfur.dioxide/total.sulfur.dioxide)                    1.600***      1.518***      1.517***  
##                                                                 (0.119)       (0.119)       (0.119)    
##   residual.sugar                                                               0.019***      0.018***  
##                                                                               (0.002)       (0.002)    
##   chlorides                                                                                 -2.043***  
##                                                                                             (0.547)    
## -------------------------------------------------------------------------------------------------------
##   R-squared                                        0.19          0.22          0.23          0.23      
##   adj. R-squared                                   0.19          0.22          0.23          0.23      
##   sigma                                            0.80          0.78          0.78          0.78      
##   F                                             1146.40        684.10        480.70        364.96      
##   p                                                0.00          0.00          0.00          0.00      
##   Log-likelihood                               -5839.39      -5750.99      -5722.16      -5715.19      
##   Deviance                                      3112.26       3001.92       2966.78       2958.36      
##   AIC                                          11684.78      11509.99      11454.31      11442.39      
##   BIC                                          11704.27      11535.97      11486.80      11481.37      
##   N                                             4898          4898          4898          4898         
## =======================================================================================================
## 
## Call:
## lm(formula = I(quality) ~ I(alcohol) + I(free.sulfur.dioxide/total.sulfur.dioxide) + 
##     residual.sugar + chlorides, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6089 -0.5241 -0.0311  0.4732  3.1069 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                  2.037785   0.134484  15.153
## I(alcohol)                                   0.326321   0.010887  29.974
## I(free.sulfur.dioxide/total.sulfur.dioxide)  1.517034   0.118941  12.755
## residual.sugar                               0.017975   0.002474   7.267
## chlorides                                   -2.042796   0.547321  -3.732
##                                             Pr(>|t|)    
## (Intercept)                                  < 2e-16 ***
## I(alcohol)                                   < 2e-16 ***
## I(free.sulfur.dioxide/total.sulfur.dioxide)  < 2e-16 ***
## residual.sugar                              4.27e-13 ***
## chlorides                                   0.000192 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7776 on 4893 degrees of freedom
## Multiple R-squared:  0.2298, Adjusted R-squared:  0.2292 
## F-statistic:   365 on 4 and 4893 DF,  p-value: < 2.2e-16

The variables in this linear model can account for 22.98% of the variance in the quality ratings of white wine samples. This model could be improved to be more accurate.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Besides confirming the relationships between alcohol, chlorides and residual sugars that were observed in the previous sections. I found that the proportion of free sulfur dioxide in total sulfur dioxide is positively correlated with the quality ratings of white wine samples. The higher the proportion, the higher the quality ratings.

Were there any interesting or surprising interactions between features?

I think the correlation between quality rating and the proportion of free sulfur dioxide in total sulfur dioxide is quite interesting. Because when looking at them separately, there is no strong correlation of either variable with quality, but the ratio of the two variables gives a strong correlation with quality ratings of the white wine samples.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, I created a linear model starting from the quality rating and the alcohol concentration and then added other variables (free sulfur dioxide proportion, chlorides, residual sugar)

The variables in the linear model account for 22.98% of the variance in the quality rating of white wine. The addition of the variables to the model slightly improves the R^2 value. However, the model still needs to be improved to better predict the quality rating based on the other variables.


Final Plots and Summary

Plot One

Description One

The quality of white wine samples correlates strongly with alcohol concentrations. Among low-quality white wine samples (quality between 3 and 5), the alcohol concentration is negatively correlated with quality, while among higher quality white wine samples (quality between 5 and 9), the alcohol concentration is positively correlated with quality.

Plot Two

Description Two

The proportion of free sulfur dioxide in total sulfur dioxide is positively correlated with the quality of white wine samples. The higher the proportion of free sulfur dioxide, the higher the quality.

Plot Three

Description Three

The alcohol concentration is negatively correlated with chlorides and residual sugar concentrations. However, the chlorides and residual sugar concentrations do not correlate directly with the quality of white wine.


Reflection

The white wine dataset contains information on almost 5000 white wine samples with 12 variables (quality rating by wine experts and 11 physicochemical properties) from 2009. I started by exploring each variable separately, and then I looked into the relationships between white wine quality and other variables. In addition, I also explored the interesting relationships I discovered among other variables. Eventually, I created a linear model to predict the quality of white wine.

There was an obvious correlation between the alcohol concentration of white wine and its quality. Surprisingly, I also discovered that there is a positive correlation between the proportion of free sulfur dioxide in the white wine and its quality. I also discovered that the density of white wine is strongly correlated with its alcohol concentration, and moderately with its chlorides and residual sugar concentrations. In addition, the chlorides and residual sugar concentrations are correlated with the alcohol concentration of white wine. I built a linear model including all the white wine samples because there is no missing data for any of the samples. And the model was able to account for 23% of the variance in the dataset.

Apparently, the model still needs to be improved. I think the limitations of the dataset could be the sample size and the lack of features. Since there are only around 5000 samples in this dataset, it’s relatively hard to fit a linear model to predict the white wine quality because the trend of some variables might not be significant. Also, there are only 11 variables of white wine physicochemical properties in this dataset, and I only found 3 that are correlated with quality, so the model might have a high bias and underfits the data. Therefore, to investigate this question further, I would obtain a larger dataset with much more samples, and also including more variables to build a model that fits the data better.

Reference

1: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

2: http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software#what-is-correlation-matrix

3: http://ggplot2.tidyverse.org/reference/qplot.html

4: http://ggplot2.tidyverse.org/reference/geom_density.html

5: http://www.sthda.com/english/wiki/ggplot2-box-plot-quick-start-guide-r-software-and-data-visualization#change-box-plot-colors-by-groups

6: http://www.sthda.com/english/wiki/ggplot2-axis-ticks-a-guide-to-customize-tick-marks-and-labels